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ABSTRACT 

In this paper, we report on indexing performance by a state- 
of-the-art keyphrase indexer, Maui, when paired with a text 
extraction procedure called text denoising. Text denoising 
is a method that extracts the denoised text, comprising the 
content-rich sentences, from full texts. The performance 
of the keyphrase indexer is demonstrated on three standard 
corpora collected from three domains, namely food and agri- 
culture, high energy physics, and biomedical science. Maui is 
trained using the full texts and denoised texts. The indexer, 
using its trained models, then extracts keyphrases from test 
sets comprising full texts, and their denoised and noise parts 
(i.e., the part of texts that remains after denoising). Ex- 
perimental findings show that against a gold standard, the 
denoised-text-trained indexer indexing full texts, performs 
either better than or as good as its benchmark performance 
produced by a full-text-trained indexer indexing full texts. 

Categories and Subject Descriptors 

H.3.1 [Information Storage and Retrieval]: Content 
Analysis and Indexing — Indexing method; H.3.3 [Information 
Storage and Retrieval] : Information Search and Retrieval; 

H. 3.4 [Information Storage and Retrieval]: Systems 
and Software — Performance evaluation (efficiency and ef- 
fectiveness) 

General Terms 

Experimentation, Performance. 

Keywords 

Keyphrase extraction, topic extraction, indexing, text de- 
noising, keyphrase indexer, machine learning model, fog in- 
dex 

I. INTRODUCTION 

Since they provide high-level descriptions of document con- 
tents, keyphrases serve as the meta-descriptions as well as a 
means to effective document retrieval from digital libraries. 
Other reasons to use keyphrases include but are not limited 
to document similarity measure, classification and cluster- 
ing, topic search, web tag clouds and document summariza- 
tion [H]. 

Today, automatic keyphrase indexing is a popular notion 
which eliminates several drawbacks of manual indexing such 
as conflicting time and effort, and poor choice of keyphrases. 
Among the automatic keyphrase indexers, several are tested 



across domains [5] [6] [9] [ll] [TI] [16] while many are domain- 
specific [5]. Most of these indexers are trained with full doc- 
uments using algorithms like Naive Bayes and Bagging to 
extract keyphrases from full-text test documents. A reveal- 
ing experiment by Witten et al. [5] demonstrates that the 
performance of the indexers depends not only on these fea- 
tures but also on document size. As they apply their full-text 
trained Keyphrase Extraction Algorithm (hereinafter, KEA) 
on paper abstracts and compare against a gold standard, 
they find its performance on these reduced texts somewhat 
inferior and not competitive to that on full texts. The au- 
thors concluded that this anomaly was unequivocal as fewer 
author-assigned keyphrases appear in the chosen reduced 
texts than in the entire document. 

Text Denoising is a method proposed by Shams and Mer- 
cer [14] which reduces the amount of text in biomedical pa- 
pers to 30% of the original. This 30% of the text is selected 
based on the Fog Index readability score [2] and is called 
denoised text; the remaining text is called the noise text. 
In this introductory work denoised text is shown to be the 
more content-rich portion of the full text as it contains most 
of the biomedical concepts that are explicitly or implicitly 
connected with biomedical relations. Although tests have 
been carried out only with biomedical research articles, the 
authors conclude that Fog Index can be a useful indicator of 
content richness for other genres and different purposes al- 
though the threshold of 30% might need to be reconsidered. 

In this paper, we report on the performance of a state-of- 
the-art keyphrase indexer named Maui [9] when paired with 
text denoising. We use three standard full text corpora from 
the food and agriculture, high energy physics, and biomedi- 
cal science domains. From each corpus, we develop training 
sets comprising full texts and their denoised parts. The test 
sets are composed of full texts, and their denoised and noise 
parts. For training and testing each dataset, we use a stan- 
dard 10-fold cross validation. We show experimentally that 
although a threshold of 30% performs well for biomedical 
relation extraction, it is 70% for keyphrase indexing. To 
evaluate Maui, we use quantitative measures like precision, 
recall and F-score as well as qualitative measures like inter- 
indexer agreements. Experimental results show that Maui, 
with denoised texts, performs either better or comparably 
to its benchmark performance — those with full-text trained 
models to extract keyphrases from full-text test sets. 

The remainder of this paper provides background on text 



denoising and the keyphrase indexer, Maui (Section Ejb, and 
discusses the methods for training and testing the indexer 
(Section [3j, an analysis of the results (Section Q, and ends 
with some concluding remarks (Section [BJ. 

2. BACKGROUND 

In this section, we briefly discuss the text denoising method 
as well as Maui, the keyphrase indexer. 

2.1 Text Denoising 

In their paper, Witten et al. [5J have demonstrated that the 
performance of the indexer KEA has been reduced when 
extracting keyphrases from paper abstracts. Similarly, the 
performance of biomedical relation miners that attempt to 
extract relations among drugs, chemicals, diseases, genes 
and proteins from paper abstracts is such that a number of 
biomedical ontologies like OMIM (Online Mendelian Inher- 
itance in Man) and GO (Gene Ontology) use human anno- 
tators to extract relations from full texts. This procedure is 
time-consuming as well as error-prone. To overcome these 
shortcomings, Shams and Mercer [14] proposed a method 
that identifies those areas within a text, called denoised text, 
where content information, such as biomedical relations, is 
more likely to occur. The authors suggested that the de- 
scribing of biomedical relations lengthens sentences and in- 
creases the use of polysyllabic words. Some readability in- 
dexes, the Fog Index [4] in particular, are based on these two 
factors. They proceeded to use Fog Index to measure sen- 
tence readability and showed experimentally that the 30% of 
the sentences which had the lowest-readability, the denoised 
part of a text, contained the relations of interest. 

Figure [I] illustrates the text denoising method applied to 
biomedical texts. Text Denoising has been evaluated with 
a corpus comprising 24 full texts that describe four related 
pairs of disease and chemical components. This method ex- 
tracted pairs of biomedical concepts from the denoised part 
of the texts of which about 75 percent are reported as re- 
lated according to the Unified Medical Language System's 
(UMLS) semantic relations network. It is noteworthy that 
the rest of the text, called noise text, did not contain any 
related biomedical concepts of interest. 
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indexers and inher its f rom and builds upon both of its pre- 
decessors KEA [H][l6] and KEA+^Q |11 . Maui uses 13 
features (among them are tfxidf and first occurrence (i.e., 
the number of words preceding a keyphrase normalized by 
the total number of words in a document) inherited from 
KEA; node degree (i.e., the number of connections between 
a candidate phrase and the other candidates in the SKOS 
hierarchy) and keyphrase length inherited from KEAH — h) 
to develop machine learning models and extract keyphrases. 
Maui's own features include tf, idf, last occurrence, spreads 
(i.e., number of words between first and last occurrence of 
a phrase), semantic relatedness and generality (i.e., position 
in the vocabulary hierarchy). 

Furthermore, Maui uses a feature named keyphraseness (i.e., 
the probability that a candidate phrase is present in the 
training documents) for domain-specific keyphrase extrac- 
tion. Frank et al. |5 reported that the use of this fea- 
ture increases indexers' performances if their training and 
test documents are retrieved from the same domain. Maui, 
moreover, can be incorporated with any domain-specific con- 
trolled vocabulary written in SKOSr] (Simple Knowledge 
Organization System) hierarchical format. Besides domain- 
specific keyphrase indexing, Maui has the capability for both 
free-text indexing and indexing for any domain that lacks 
a controlled vocabulary. For the latter case, Maui uses 
Wikipedia as a domain-independent controlled vocabulary. 

Maui has been evaluated with texts from three different do- 
mains: food and agriculture, nuclear physics, and biomedi- 
cal science. Although the performance is not comparable to 
that of English, it has also been tested with texts written 
in French and Spanish. Experimental outcomes show that 
Maui outperforms its predecessors in both cases [9]. More- 
over, it performs significantly better than Medical Text In- 
dexer (MTQ [3] and BibClassifjj^] [12] for indexing biomed- 
ical and physics texts, respectively. 

3. METHODOLOGY 

In this section, we describe the datasets, training and testing 
procedure, performance measures, and the means to find the 
appropriate text denoising threshold for keyphrase indexing. 

We use the datasets and follow the experimental protocols 
set by Medelyan [9] except that we train Maui not only on 
full texts but also on their denoised parts and test it on full 
texts as well as their denoised and noise parts. 
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Figure 1: Text denoising and connected concept extraction 
method described by Shams and Mercer |T4j 

2.2 Maui 

Automatic keyphrase indexing has been in practice for half 
a century [8], but until recently the performance was not no- 
table. Maujj|9 is the final successor of a legacy of keyphrase 

1 http : / /code .google .com / p /maui-indexer / 



3.1 Datasets 

In this experiment, to train and test Maui, we use three stan- 
dard corpora of full texts and keyphrases associated with 
them from three different domains. These corpora were col- 
lected by Medelyan [9] during her doctoral research. 

The first dataset, which is referred to as FAO-780, contains 
780 full-text documents and their keyphrases. The docu- 
ments have been selected randomly from the Food and Agri- 



2 http:/ /www. nzdl.org/Kea/ 

3 http:/ /www. w3.org/2001/sw/wiki/SKOS/Datasets 

4 http://skr.nlm.nih.gov/SKR_API/index.shtml 

5 http: / /invenio-demo.cern.ch/help/hacking/bibclassify- 
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Figure 2: Keyphrase extraction from fc-th fold 



culture Organization (FAO) data repository. The dataset 
contains about 24 million words (30, 800 words on average 
per document) and 6, 225 keyplirases (8 keyphrases on av- 
erage per document, ranging from 2 to 23). An in-depth 
analysis of the documents reveals that the set is composed 
of both research articles and experimental reports. 

Our second dataset comprises 290 full-text documents and 
their keyphrases on high energy physics randomly collected 
from the European Organization for Nuclear Research (ab- 
breviated as CERN) document server and thus named CERN- 
290. Each document contained therein has an average of 
6, 300 words and 7 keyphrases. CERN-290 is the smallest 
dataset that we use in this experiment and contains mainly 
experimental reports. 

We also used the NLM-500 corpus, collected by the NLM In- 
dexing Initiative [2] during the development of MTI, which 
consists of 500 biomedical research articles. This corpus 
has documents with average length of 4, 500 words and an 
average number of assigned keyphrases of 15 ranging from 
2 to 30. The contents of the dataset are mainly schol- 
arly research articles collected from the National Library 
of Medicine repository. 



3.2 Training and Testing 

In our first attempts at pairing Maui with reduced texts, we 
noted that Witten et al. [16] , using the Computer Science 
Technical Reports (CSTR) corpus, showed that any training 
set containing more than 25 documents has very little effect 
on the indexer's performance. In our initial experiments 
with Maui we followed this protocol by randomly choosing 
25 training and 100 test documents from the NLM-500 cor- 
pus. Against a gold standard — author-assigned keyphrases 
for the 100 test documents — we measured Maui's precision, 
recall and F-score. From these experiments, we have seen 
that 



• although the performance of Maui with the denoised text 
trained model is better than that with full-text trained 
model, the improvement is not statistically significant and 
the improvement does not reflect on the entire population, 
and 

• Maui's performance improves if we increase the text de- 
noising threshold from 30% to 40%. 



The first observation indicates that the methods followed by 
Witten et al. [16] can be effective for certain domains but 
are not an effective means for many others while the lat- 
ter indicates that for keyphrase indexing, the text denoising 
threshold is not 30%. 

Therefore, we decided to use a more conventional fc-fold ex- 
perimental approach. We followed the experimental proce- 
dure illustrated in Figure [2] We consider the full texts from 
each dataset and divide them randomly into 10 equal-sized 
folds where the documents in one fold do not overlap with 
the others. In addition, we keep the denoised and noise parts 
of each fold. Then, we apply a standard 10-fold cross valida- 
tion to train and test Maui. To generate each pair, we keep 
one of the 10 folds out as our testing set and combine the 
rest of the 9 folds as our training set. Doing this 10 times, 
each time leaving out a different one from the 10 folds as a 
testing set, we get 10 pairs. We train Maui on the training 
sets comprising full texts and denoised texts from each fold. 
In this way, we develop 20 trained models for the entire 10 
folds. The models the indexers develop from full texts are 
called full-text trained models and those that are developed 
from denoised texts are called denoised text trained models. 

As the trained models are created, the indexers then ap- 
ply them, fc-th full-text trained model and A;-th denoised 
text trained model, to extract keyphrases from the fc-th test 
set composed of full texts, and their denoised and noise 
parts. According to the average number of keyphrases in 
every document, we had the indexers extract 8 keyphrases, 
7 keyphrases, and 15 keyphrases for each document in the 
FAO-780, CERN-290, and NLM-500 test sets, respectively. 
The extracted keyphrases are then compared against a gold 
standard which are the author assigned keyphrases associ- 
ated with the test documents. The testing has been carried 
out for the rest of the folds and the performance measures 
described in section [5~3| are then averaged. 



It is noteworthy that during training and testing, we used 
controlled vocabularies for the respective domains, and dur- 
ing testing we set the minimum and maximum length of 
the keyphrases to be extracted per document to 1 and 5, 
respectively, as this is the default setting of Maui. 

3.3 Performance Measures 

In this experiment, we use the conventional quantitative 
measures for performance evaluation — precision, recall and 
F-score. In addition, we use three inter-indexing agree- 
ment measures popularly used for qualitative indexing as- 
sessment [To]. The measures are called Hooper's (H), Roll- 
ing's (R) and Cosine (C) inter-indexing agreements. The 
common property of these agreement measures is that they 
provide the number of correct keyphrases in relation to the 
size of the two sets of keyphrases being compared. We 
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Figure 3: Text denoising threshold for FAO-780 dataset 



briefly summarize these agreement measures for the reader's 
convenience. If M and N are the number of idiosyncratic 
keyphrases assigned by two indexers and O is the number of 
phrases two indexers have in common, then Hooper's mea- 
sure [7] is 

Htindexeri, indexes) = — — — — . 

v M + N — O 



Similarly, Rolling's measure 13 is denned as 



R(indexeri, indexer^) 



20 
M + N ' 



Cosine measure uses the geometric mean instead of Rolling's 
arithmetic mean. Thus, Cosine measure can be written as 



C(vndexeri, indexeri) = 



O 



Vm-n 



The last two measures are almost identical unless the sets 
radically vary. It can be noted that Hooper's and Rolling's 
measures are identical to Jaccard's coefficient and the Dice 
coefficient, respectively, which are used to measure statisti- 
cal similarity between two sets. The closer the agreement 
measures are to 1, the more the indexers agree on extracted 
keyphrases. 

We also calculate the error rates for every cross validation. 
The error rate is defined as 



E 



FP+FN 



where FP and FN are the number of false positives and 
false negatives, respectively and N is the total number of 
instances in the test sets. The reasons for using the error 
rates are twofold, first, to find a text denoising threshold de- 
scribed in Section [3.4| and second, to measure a 10-fold cross 
validated paired t-test [Tj to report significant improvement 
of Maui when paired with denoised texts. For any two given 
sets of results, we consider their error rates to calculate a 
paired i-value. If this calculated it-value lies outside ±2.26 
with a degree of freedom 9, then the difference between the 
set whose results have the lower error rate and the other set 
is said to be statistically significant at significance level a — 
0.05. 



3.4 Text Denoising Threshold 

To find the appropriate text denoising threshold for keyphrase 
indexing, we evaluate Maui's performance on each dataset 
by increasing the text denoising threshold in increments of 
10% from 30% to 90%. As we vary the threshold, we plot the 
error rates of Maui on different test sets. Because Maui ap- 
plies a supervised learning algorithm to develop its trained 
models, the best-fitted model should be where the test er- 
ror has its global minimum. Therefore, the objective of this 
plotting is to discover the global minimum with its Full-text 
and Denoised-text trained models. This global minimum 
will eventually be the denoising threshold. 

As an example, Figure [3] shows the error rates for different 
denoising thresholds. It is notable that when we use the 
Denoised-text trained models to extract keyphrases from ei- 
ther of the test sets, or use the Full-text trained models on 
Denoised-text test sets, Maui has its global minimum at 70% 
denoising (Figure [3a|. From this point on, the error rate in- 
creases and thus indicates an overfitting in Maui's models. 
Figure |3bJ on the other hand, shows that no matter which 
trained model is used, full-text or denoised, the error rate 
for noise test sets increases with increasing thresholds. This 
indicates that noise texts are not content-rich as Maui fails 
to extract a substantial number of keyphrases from them. 
Figure[3]shows that Maui's best performing pair is Denoised- 
Full — those models that are trained with denoised texts for 
keyphrase extraction from full texts. 

Similarly, Maui's best-fitted models with denoised texts for 
the CERN-290 dataset are also at the 70% threshold (Figure 
|4a[ |. Maui's models — full-text or denoised — experience over- 
fitting after this threshold. Similar to what we observed for 
the FAO-780 dataset, Maui has no improvement with either 
of its trained models on the noise test set (Figure [4b] ). Maui 
best performs on the CERN-290 dataset with its full-text 
trained models to extract keyphrases from denoised texts 
(Figure |J, unlike that on FAO-780. 

Maui's best-fitted models with denoised texts for the NLM- 
500 corpus are also at 70% threshold, except for the full- 
text trained model on denoised-text test sets, (Figure [5a[ ) . 
In fact, Maui does not have any global minimum when it 
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Figure 4: Text denoising threshold for CERN-290 dataset 
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Figure 5: Text denoising threshold for NLM-500 dataset 



applies a full-text trained model on denoised test sets until 
its denoising threshold is set at 90%. Maui's performance 
with noise texts from this domain is similar to that from the 
other domains (Figure 5b I. Like its performance on FAO- 



780 dataset, Maui best performs on full text test sets with 
its denoised text trained model for NLM-500. 

These observations lead us to set the denoising threshold 
at 70%. At this threshold, Maui predicts keyphrases from 
unseen test examples most accurately. 

4. RESULTS AND DISCUSSIONS 

In this section, we discuss the performance of Maui with 
text denoising and compare this with its benchmark perfor- 
mance. 

Table [Ta| shows the precision, recall and F-score of Maui with 
denoised texts and its benchmark performance on the FAO- 
780 dataset. Maui, as it uses its denoised text and full-text 
trained models on denoised text test sets, achieves F-scores 
of 31.36 and 31.63, respectively, compared to its benchmark 
F-score of 31.86. By applying the 10-fold cross validation t- 



test, we see that for these two cases, the t- values are 2.23 and 
1.81, respectively, which means that the differences between 
the F-scores are not statistically significant at a — 0.05. In 
other words, the benchmark performance of Maui cannot be 
said to be different than that with text denoising. On the 
other hand, Maui's F-score with its denoised text trained 
model on full-text keyphrase extraction is 31.87. A signifi- 
cance test shows that its t-value is 2.76 which indicates that 
it is different at a significance level of a = 0.02. So, with 
98% confidence we can say that the result is better than the 
benchmark performance. In addition, from Table [2a| we can 
see that Maui's agreements with the gold standards are as 
good as the benchmark agreements. This demonstrates that 
the indexing quality of Maui has not been compromised with 
text denoising. 

For CERN-290 dataset, although Maui could not outper- 
form its benchmark F-score of 24.99, none of the t-values 
are significant at a = 0.05. In other words, its performance 
with denoised texts cannot be said to be different than its 
benchmark performance with a 95% confidence level. Its 
best performance with denoised texts is when it uses the 
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(a) Maui's performance on FAO-780 dataset with 70% of the texts 
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(b) Maui's performance on CERN-290 dataset with 70% of the texts 
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(c) Maui's performance on NLM-500 dataset with 70% of the texts 



Table 1: Precision, recall and F-score of Maui with text denoising 



full-text trained model to extract keyphrases from denoised 
texts (F-score of 24.82). Maui's performance details on the 
CERN-290 dataset are given in Table [Tb] 

It is noteworthy that the performance of BibClassify, a spe- 
cialized keyphrase indexer developed by CERN for physics 
documents, on the CERN-290 documents is 15.40 precision, 
24.3 recall and 18.80 F-score 91. If we compare this per- 
formance with Maui (Table [Tb I, then we can see that using 
70% of the text, Maui outperforms BibClassify. 

Interestingly enough, although Maui agrees less with the 
gold standard keyphrases for CERN-290 dataset than that 
for FAO-780, its agreements on keyphrases with denoised 
texts are as good as its benchmark performance. The details 
for its inter-indexing agreement measures on the CERN- 
290 dataset are listed in Table |2b| Like its performance 
on FAO-780, we see that Maui extracts quality keyphrases 
from physics documents when paired with text denoising 
compared to that with full texts. 

Maui's best performance for the NLM-500 corpus is with 
a denoised text trained model on full texts. Its F-score of 
31.50 outperforms the benchmark F-score of 31.13 at the 
significance level of a — 0.05. However, although its other 
two F-scores with denoised texts is somewhat lower than its 
benchmark F-score, they are not statistically significant with 
t-values of 2.01 and 1.85 (Table Tc|. Maui's inter-indexing 
agreement on NLM-500 is somewhat similar to that on FAO- 
780 except that it agrees more with NLM-500 gold standards 
than its benchmark performance (Table [2c]) . 



5. CONCLUSIONS 

In this experiment, we consider full texts as well as their de- 
noised and noise parts from different domains like food and 
agriculture, physics, and biomedical science. For each genre 
of texts, we have seen that Maui's trained models overfit if 



we set denoising threshold beyond 70%. Considering this 
as our denoising threshold for keyphrase indexing, we test 
Maui on these texts, full and reduced. 

From its experimental results, we show in this paper that 
text denoising improves Maui's performance for the biomed- 
ical science texts, or it allows Maui to perform as good as 
its benchmark performance on the food and agriculture, and 
the physics texts. It does so by reducing Maui's training and 
test sets to 70%. For instance, the FAO-780 dataset of 24 
million words has been reduced by text denoising to a set 
of 17 million words and Maui does not perform poorer than 
its benchmark performance. In other words, the 7 million 
removed words are not potential candidates as keyphrases. 
Text denoising, as expected, left out the words from being 
considered so. 

Although there are some cases where Maui, when paired 
with text denoising, experiences marginally lower F-score 
than its benchmark, indexing agreement measures show that 
its indexing quality has never been compromised; it extracts 
even better quality keyphrases from biomedical texts than 
its benchmark. 

It is noteworthy that during this experiment, we did not 
change the way Maui works rather we tested it to see its 
response on a set of reduced texts. 

Therefore, our experimental findings reveal that 



• document size, per se, does not have the suggested ef- 
fect on keyphrase indexing — it is the content richness that 
plays the key role in indexing, 

• text denoising produces a content-rich set of sentences 
which can improve indexer performance, 

• the noise texts, i.e., the removed text, do not improve 
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0.18 
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0.18 


0.30 


0.31 


Denoised Text 
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0.18 


0.30 


0.31 


Full Text 


Denoised Text 


0.18 


0.30 


0.30 



(a) Maui's indexing agreements on FAO-780 dataset with 70% of the texts 





Benchmark Performance 


Trained Model 


Test Set 


Hooper 


Rolling 


Cosine 


Hooper 


Rolling 
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Denoised Text 


Denoised Text 


0.14 


0.24 


0.24 


0.14 


0.24 


0.24 


Denoised Text 


Full Text 


0.14 


0.24 


0.24 


Full Text 


Denoised Text 


0.14 


0.24 


0.24 


(b) Maui's indexing agreements on CERN-290 dataset with 70% of the texts 




Benchmark Performance 


Trained Model 


Test Set 


Hooper 


Rolling 


Cosine 


Hooper 


Rolling 


Cosine 


Denoised Text 


Denoised Text 


0.18 


0.30 


0.30 


0.18 


0.30 


0.31 


Denoised Text 


Full Text 


0.19 


0.31 


0.31 


Full Text 


Denoised Text 


0.18 


0.30 


0.30 



(c) Maui's indexing agreements on NLM-500 dataset with 70% of the texts 
Table 2: Inter-indexing agreements of Maui with text denoising 



indexing rather they increase the error rates, 

text denoising is useful not only for biomedical relation 
extraction but also for keyphrase indexing, and 

text denoising can be used for different domains other 
than biomedical science 



With these results in mind and recalling that there are other 
indexers that use different features to train their machine 
learning models, we are interested in further investigating 
the effect of text denoising on them. In addition, when 
paired with text denoising, Maui performs better on biomed- 
ical texts than texts from agriculture and physics. Because it 
has been originally developed for relation mining in biomedi- 
cal texts, we are also interested to explore the reasons behind 
the success of text denoising for keyphrase indexing in this 
domain. These investigations are left for future work. 
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